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Abstract 

We study the problem of recovering the subspace 
spanned by the brst k principal components of 
d-dimensional data under the streaming setting, 
with a memory bound of 0{kd). Two families of 
algorithms are known for this problem. The first 
family is based on the framework of stochastic 
gradient descent. Nevertheless, the convergence 
rate of the family can be seriously affected by the 
learning rate of the descent steps and deserves 
more serious study. The second family is based 
on the power method over blocks of data, but set¬ 
ting the block size for its existing algorithms is 
not an easy task. In this paper, we analyze the 
convergence rate of a representative algorithm 
with decayed learning rate (Oja and Karhunen, 
1985) in the first family for the general k > 1 
case. Moreover, we propose a novel algorithm 
for the second family that sets the block sizes 
automatically and dynamically with faster con¬ 
vergence rate. We then conduct empirical stud¬ 
ies that fairly compare the two families on real- 
world data. The studies reveal the advantages 
and disadvantages of these two families. 

1 Introduction 

For data points in the goal of principal component anal¬ 
ysis (PCA) is to hnd the first k d eigenvectors (prin¬ 
cipal components) that correspond to the top-fc eigenval¬ 
ues of the d X d covariance matrix. For a batch of stored 
data points with a moderate d, efficient algorithms like the 
power method (Golub and Van Loan, 1996) can be run on 
the empirical covariance matrix to compute the solution. 

In addition to the batch algorithms, the stream setting 
(streaming PCA) is attracting much research attention in 


recent years (Arora et ah, 2012; Mitliagkas et ah, 2013; 
Hardt and Price, 2014). Streaming PCA assumes that each 
data point x G arrives sequentially and it is not feasi¬ 
ble to store all data points. If d is moderate, the empiri¬ 
cal covariance matrix can again be computed and fed to an 
eigenproblem solver to compute the streaming PCA solu¬ 
tion. When d is huge, however, it is not feasible to store the 
0{d?') empirical covariance matrix. The situation arises in 
many modern applications of PCA. Those applications call 
for memory-restricted streaming PCA, which will be the 
main focus of this paper. We shall consider restricting to 
only 0(kd) memory usage, which is of the same order as 
the minimum amount needed for the PCA solution. In addi¬ 
tion, we aim to develop streaming PCA algorithms that can 
keep improving the goodness of the solution as more data 
points arrive. Such algorithms are free from a pre-specihed 
goal of goodness and match the practical needs better. 

There are two measurements for the goodness of the so¬ 
lution. One is the reconstruction error that measures the 
expected squared error when projecting a data point to the 
solution, which is based on the fact that the actual princi¬ 
pal components should result in the lowest reconstruction 
error. The other is the spectral error that measures the dif¬ 
ference between the subspace spanned by the solution and 
the subspace spanned by the actual principal components, 
which will be formally dehned in Section 2. The spectral 
error enjoys a wide range of practical applications (Sa et ah, 
2015). In addition, note that when the and (A:-|-l)*^ 
eigenvalues are close, the solution that wrongly includes 
the (A:-1-1)*^ engenvector instead of the A:*^ one may still 
reach a small reconstruction error, but the spectral error can 
be large. That is, the spectral error is somewhat harder to 
knock down and will be the main focus of this paper. 

There are several existing streaming PCA algorithms, but 
not all of them focus on the spectral error and meet the 
memory restriction. For instance, Karnin and Liberty 
(2015) proposed an algorithm which considers the spec¬ 
tral error, but its space complexity is at least f2(A;dlogn), 
where n is the number of data points received. Based on 
Warmuth and Kuzmin (2008), Nie et al. (2013) proposed an 
algorithm along with regret guarantee on the reconstruction 
error, but not the spectral error, and its space complexity 
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grows in the order of (P. Arora et al. (2013) extended Arora 
et al. (2012) to derive convergence analysis for minimiz¬ 
ing the reconstruction error along with a memory-efficient 
implementation, but the space complexity is not precisely 
guaranteed to meet 0{kd). That is, those works do not 
match the focus of this paper. 

There are two families of algorithms that tackle the spectral 
error while respecting the memory restriction, the family of 
stochastic gradient descent (SGD) algorithms for PCA, and 
the family of block power methods. The SGD family solves 
a non-convex optimization problem that minimizes the re¬ 
construction error, and applies SGD (Oja and Karhunen, 
1985) under the memory restrictions to design streaming 
PCA algorithms. Interestingly, although the non-convex 
problem does not match standard convergence assumptions 
of SGD (Rakhlin et al., 2012), minimizing the reconstruc¬ 
tion error for the special case of fc = 1 allows Balsubra- 
mani et al. (2013) to derive spectral-error guarantees on 
the classic stochastic-gradient-descent PCA (SPCA) algo¬ 
rithm (Oja and Karhunen, 1985). Recently, Sa et al. (2015) 
derive a spectral-error minimization algorithm for the gen¬ 
eral fc > 1 cases based on SGD along with strong theoreti¬ 
cal guarantees. Nevertheless, different from Balsubramani 
et al. (2013), Sa et al. (2015) require a pre-specified error 
goal, which is taken to determine a fixed learning rate of the 
descent step. The pre-specified goal makes the algorithm 
inflexible in taking more data points to further decrease the 
error. Furthermore, the fixed learning rate is inevitably con¬ 
servative to keep the algorithm stable, but the conservative 
nature results in slow convergence in practice, as will be 
revealed from the experimental results in Section 5. 

The other family, namely the block power meth¬ 
ods (Mitliagkas et al., 2013), extends the batch power 
method (Golub and Van Loan, 1996) for the memory- 
restricted streaming PCA by defining blocks (periods) on 
the time line. The key of the block power methods is to 
efficiently compute the product of the estimated covariance 
matrices in different blocks. The product serves as an ap¬ 
proximation to the power of the empirical covariance ma¬ 
trix, which is a core element of the batch power method. 
This family could also be viewed as the mini-batch SGD 
algorithms but with different update rule from the SGD 
family. The original block-power-method PCA (BPCA; 
Mitliagkas et al., 2013) is proved to converge under some 
restricted distributions, which is later generalized by Hardt 
and Price (2014) to a broader class of distributions. The 
convergence proof of BPCA in both works, however, de¬ 
pends on determining the block size from the total number 
of data points or a pre-specified error goal, which again 
make the works inflexible for further decreasing the error 
with more data points. 

From the theoretical perspective, SPCA lacks convergence 
proof for the general A: > 1 case without depending on the 
pre-specified error goal nor the fixed learning rate, and it 


is non-trivial to directly extend the fixed-leaming-rate re¬ 
sult of Sa et al. (2015) to realize the proof; from the algo¬ 
rithmic perspective, BPCA needs more algorithmic study 
on deciding the block size without depending on the pre- 
specified error goal; from the practical perspective, it is not 
clear which family should be preferred in real-world ap¬ 
plications. This paper makes contributions on all the three 
perspective. We first prove the convergence of SPCA for 
k > 1 with a decaying learning rate scheme in Section 3. 
The convergence result turns out to be asymptotically sim¬ 
ilar to the result of Sa et al. (2015) while not relying on 
the fixed learning rate. Then in Section 4, we propose a 
dynamic block power method (DBPCA) that automatically 
decides the block size to not only allow easier algorith¬ 
mic use but also guarantee better convergence rate. Finally, 
we conduct experiments on real-world datasets and provide 
concrete recommendations in Section 5. 


2 Preliminaries 


Let us first introduce some notations which will be used 
later. First, let x < 0{y) and x > ^{y) denote that for 
some universal constant c, independent of all our parame¬ 
ters, x < cy and x > cy, respectively, for a large enough y. 
Next, let [a:] denote the smallest integer that is at least x. 
Finally, for a vector x, we let ||x|| denote its £ 2 -norm, and 
for a matrix M, we let ||M|| = maxx which is the 

spectral norm. 

In this paper, we study the streaming PCA problem, 
in which with each input data point x„ G is re¬ 
ceived at step n within a stream. Following previous 
works (Mitliagkas et al., 2013; Balsubramani et al., 2013), 
we make the following assumption on the data distribution. 

Assumption 1. Assume that each x„ is sampled indepen¬ 
dently from some distribution X with mean zero and co- 
variance matrix A, which has eigenvalues Ai > A 2 > 
■ ■ ■ > Ad, with \i > Ai+i. Moreover, assume that |jx|| < 1 
for anyx. in the support of Xwhich implies that \\A\\ < 1 
and ||x„xj 11^1 for each n. 


Our goal is to And adxk matrix Qn at each step n, with its 
column-space quickly approaching that spanned by the first 
k eigenvectors of A. For convenience, we let A = Afc and 
A = Afc+i, and moreover, let U denote the dxk matrix with 
the first k eigenvectors of A as its columns. One common 
way to measure the distance between such two spaces is 


$ 


n 


max 


l|C/^0nV||2\ 

l|vP J 


( 1 ) 


which can be used as an error measure for Q^- It is known 
that = sm0k(U,Qn)^, where 9k{U,Qn) is the fc-th 

*We can relax this condition to that of having a small ||x|| with 
a high probability as Hardt and Price (2014) do, but we choose this 
stronger condition to simplify our presentation. 
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Algorithm 1 SPCA 

1: So ~ 

2: So = QoRf) (QR decomposition) 

3: n ^ 1 

4: while receiving data do 
5. Sn ^ Qn — 1 ~i~ 'yn^n^n Qn — 1 
6: Sn = QnB„ (QR decomposition) 

7: n ■(— n + 1 

8: end while 


principle angle between these two spaces. For simplicity, 
we will denote sin0fe(C7, Q„) by sin(?7, Q^). Moreover, 
let cos(17, (5„) = ^1 - sin(t7, Qn)^ and tan(C7, Q„) = 
sin(?7, Qn)/cos(U, Q„). It is also known that cos({7, Q„) 
equals the smallest singular value of the matrix Qn- 
More can be found in, e.g., Golub and Van Loan (1996). 

Our algorithms will generate an initial matrix So € 
by sampling each of its entries independently from the nor¬ 
mal distribution A/'(0,1). Let So ~ A/’(0,denote this 
process, and we will rely on the following guarantee. 


Lemma 1. (Mitliagkas et at, 2013) Suppose we sam¬ 
ple Sq ~ A/'(0 ,and let So = QqRo be its 
QR decomposition. Then for a large enough con¬ 
stant c, there is a small enough constant So such that 


Pr 


cos(17, Qo) < s/cKdk) 


< ^0- 


3 Stochastic Gradient Descent 

In this section, we study the classic PCA algorithm frame¬ 
work of Oja and Karhunen (1985) for the general rank-A: 
case, which can be seen as performing stochastic gradient 
descent. Our algorithm, called SPCA, is given in Algo¬ 
rithm 1. The key component is to determine the learning 
rate, which is related to the error analysis. We choose the 
step size at step n as 

c Co 

7 „ = —, with c = -^ for a constant cq > 12. 

n A- A 

The algorithm has a space complexity of 0{kd), by noting 
that the computation of can be done by first 

computing x,[ Qn-i and then multiplying the result by x„. 
The sample complexity of our algorithm is guaranteed by 
the following, which we prove in Subsection 3.1. Our anal¬ 
ysis is inspired by and follows closely that of Balsubramani 
et al. (2013) for the rank-one case, but there are several new 
hurdles which we need to overcome in the general rank-A: 
case. 

Theorem 1. For any p £ (0,1), there is some N < 
-|_ Q ( ^ ^ , such that our algo- 

rithm with high probability can achieve < p for any 
n > N. 


Let us remark that we did not attempt to optimize the first 
term in the bound above, as it is dominated by the second 
term for a small enough p. Note that Sa et al. (2015) pro¬ 
vided a better bound, which only has quadratic dependence 
of the eigengap A — A, for a similar algorithm called Alec- 
ton. Alecton is restricted to taking a fixed learning rate that 
comes from a pre-specified error goal on a fixed amount 
of to-be-received data points. The restriction makes Alec¬ 
ton less practical in the streaming setting, because one may 
not always be able to know the amount of to-be-received 
data points in advance. If one receives fewer points than 
needed, Alecton cannot achieve the error goal; if one re¬ 
ceives more than needed, Alecton cannot fully exploit the 
additional points for a smaller error. The decaying learning 
rate used by our proposed SPCA algorithm, on the other 
hand, does not suffer from such a restriction. 

3.1 Proof of Theorem 1 

The analysis of Balsubramani et al. (2013) works for the 
rank-one case by using a potential function = 1 — 
(U^Qn)'^, where U and Qn are both vectors instead of 
matrices. To work in the general rank-A: case, we choose 
the function defined in (1) as a generalization of their 
T'ra, and our goal is to bound 

Following Balsubramani et al. (2013), we divide the steps 
into epochs, with epoch i ranging from step to step 
rii — 1, where we choose no = (fk^d^ log d, for a large 
enough constant c, and Ui = (ni_i -f 1) — 1 for 

i > 1. 

Remark 1. This gives us (n^ + 1) > + 1) and 

Ui < CiUi-ifor some constant ci- 


As in Balsubramani et al. (2013), we also use the conven¬ 
tion of starting from step no- For each epoch i, we would 
like to establish an upper bound pi on for each step 
n in that epoch. To start with, we know the following 
from Lemma 1, using the fact that <l>o = sin(t/, Qo)^ = 
1 - cos(U, QoY- 

Lemma 2. Let Fq denote the event that $0 < Po> where 
po = 1 — c/{kd) for the constant c in Lemma 1. Then we 
have Pr [^Fg] < iJg. 

Next, for each epoch z > 1, we consider the event 
Fi : sup < Pi, 

ni_i<n<ni 

for some pi to be specified later. Then our goal is to show 
that Pr[^ri_|_i jFi] is small, for z > 0. This can be done 
for the rank-one case, but it relies crucially on the property 
that the potential function of Balsubramani et al. (2013) 
satisfies a nice recurrence relation. Unfortunately, this does 
not appear so for our function mainly because it takes 
an additional maximization over z; € . To overcome this 

problem, we take the following approach. 
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Consider an epoch i and a step n in the epoch. Let 
us define a new matrix = (/ + 7 „x„xJ)y„_i = 
QnRnRn—l ‘ ' ' — i + with — Qni — \ ■ Let S — 

{v S : ||v|| = 1}. Then for any v G 5, define 

^(v) 1 WU^YnA^ 

” rnV||2 > 

and note that = maxvgs Now for each such new 
function with a fixed v, we can establish a similar 

recurrence relation as follows, but for our purpose later we 
show a better upper bound on |Z„| than that in Balsubra- 
mani et al. (2013). We give the proof in Appendix A. 

Lemma 3. For any n > ng and any v G 5, we have 
+ Pn- Zn, where 

1. j3n = 57^ + 273 

2. |Z„| < 27„y^ 

1 E [Zr.\Tn-i] > 27n(A - - <l>i"Ji) > 0.2 

With this lemma, the analysis of Balsubramani et al. (2013) 
can be used to show that E[<l>i'"^] decreases as n grows, but 
only for each individual v separately. This alone is not 
sufficient to guarantee the event Li+i as it requires small 
for all v’s simultaneously. To deal with this, a natu¬ 
ral approach is to show that each is large with a small 
probability, and then apply a union bound, but an apparent 
difficulty is that there are infinitely many v’s. We will over¬ 
come this difficulty by showing how it is possible to apply 
a union bound only over a finite set of “e-net” for these in¬ 
finitely many v’s. Still, for this approach to work, we need 
the probability of having a large $1'"^ to be small enough, 
compared to the size of the e-net. However, the beginning 
steps of the first epoch seem to have us in trouble already as 
the probability of their values exceeding is not 
small. This seems to prevent us from having an error bound 
Pi < po, and without this to start, it is not clear if we could 
have smaller and smaller error bounds for later epochs. To 
handle this, we sacrifice the first epoch by using an error 
bound Pi slightly larger than pg, but still small enough. The 
hope is that once Li is established, we then have a period 
of small errors, and later epochs could then start to have 
decreasing p^’s. More precisely, we have the following for 
the first epoch, which we prove in Appendix B. 

Lemma 4. Let pi = 1 — c/[c\'^kd), for the constant ci 
given in Remark 1. Then Pr [^L 1 | Lq] = 0. 

It remains to set the error bounds for later epochs appropri¬ 
ately so that we can actually have small Pr[^ri+i|ri], for 
f > 1. We let the error bounds decrease in three phases as 

^As in Balsubramani et al. (2013), Fn-i here denotes the a- 
field of all outcomes up to and including step n — 1. 


follows. In the first phase, we let pi = 1 — 2(1 — Pi-i), 
so that r]i = 1 — pi doubles each time. It ends at the 
first epoch i, denoted by tti, such that pi < 3/4. Note 
that TTi < C>(logd) and at this point, p^^ is still much 
larger than 1/71,^. Then in the second phase, we let 
Pi = pi-i/\e^/‘^°Af, which decreases in a faster rate than 
Ui increases. It ends at the first epoch i, denoted by -K 2 , 
such that Pi < C 2 (c^fc logni_i)/(ni_i-I-1), for some con¬ 
stant C 2.2 Note that 7r2 < 0{\ogd) and at this point, p^^ 
reaches about the order of 1/• Finally in phase three, 
we let Pi = C 2 {c^k logni_i)/(ni_i -I- 1), which decreases 
in about the rate as rii increases. 

With these choices, the events Pi’s are now defined, and 
our key lemma is the following, which we prove in Ap¬ 
pendix C. The proof handles the difficulties above by show¬ 
ing how a union bound can be applied only on a small “e- 
net” of S along with proper choices of pi to guarantee that 
each is large with a small enough probability. 

Lemma 5. For any f > 1, Pr [^Li+i | Ti] < ^{i+iy 

From these lemmas, we can bound the failure probability 
of our algorithm as 

Pr[3z>0:-r,.] < Pr hPo] + E.>o K.+i I F,] 

< <^0 + '^i>g 2{i+iy ’ 

which is at most 2i5o using the fact that X]i>i 1/*^ ^ 2. 

To complete the proof, it remains to determine the num¬ 
ber of samples needed by our algorithm to achieve an error 
bound p. This amounts to determine the number of an 
epoch i with pi < p. With > ng, it is not hard to 
check that p^^ < l/( 2 '’’fc(i)‘^F) and < (2'^A:d)®P^. 
Then if p > p,^^, we can certainly use up¬ 

per bound. If p < Ptts j it is not hard to check that with 
Ui < 0(c^A:(l/p) log(l/p)), we can have pi < p. As 
c = co/(A — A), this proves Theorem 1. 

4 Block-Wise Power Method 

In this section, we turn to study a different approach based 
on block-wise power methods. Our algorithm is modified 
from that of Mitliagkas et al. (2013) (referred as BPCA), 
which updates the estimate Qn with a more accurate esti¬ 
mate of A using a block of samples, instead of one single 
sample as in our first algorithm. Our algorithm differs from 
BPCA by allowing different block sizes, instead of a fixed 
size. More precisely, we divide the steps into blocks, with 
block i consisting of steps from some interval li, and we 
use this block of |/i| samples to update our estimate from 
Qi-i to Qi- We will specify \Ii \ later in (3), which basi¬ 
cally grows exponentially after some initial blocks. We call 
our algorithm DBPCA, as described in Algorithm 2. 

^Determined later in the proof of Lemma 5 in Appendix C for 
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Algorithm 2 DBPCA 

1: So ~ 

2: So = QoRq (QR-decomposition) 

3: i ^ 1 

4: while receiving data do 
5: Si ^ 0 

6: for n € /i do 

7. Si ^ Si 4“ I/ -1 Qi — 1 

8: end for 

9: Si = QiRi (QR-decomposition) 

10: i ■<— i + 1 

11: end while 


This algorithm, as our first algorithm SPCA, also has a 
space complexity of 0{kd). The sample complexity is 
guaranteed by the following, which we will prove in Sub¬ 
section 4.1. To have a easier comparison with the results of 
Mitliagkas et al. (2013) and Hardt and Price (2014), we use 
= sm{U, Qn) as the error measure in this section. 

Theorem 2. Given any e < l/'/kd, our algorithm 

can achieve an error e with high probability after L 
iterations with a total of N samples, for some L < 

Let us make some remarks about the theorem. First, the 
error p in Theorem 1 corresponds to the error here, and 
one can see that the bound in Theorem 2 is better than those 
in Theorem 1 and Mitliagkas et al. (2013); Hardt and Price 
(2014) in general. We summarize the sample complexity 
in terms of the error e in Table 1. Next, the condition 
e < \/\/kd in the theorem is only used to simplify the 
error bound. One can check that our analysis also works 
for any e < 1, but the resulting bound for N has the factor 
replaced by min(l/ {kd), e^). Finally, from Theorem 2, 
one can also express the error in terms of the number of 

samples n as e[n) < O log ■ 

4.1 Proof of Theorem 2 

Recall that after the i-th block, we have the estimate Qi, 
and we would like it to be close to U, with a small error 
sin(?7, Qi). To bound this error, we follow Hardt and Price 
(2014) and work on bounding a surrogate error tan(?7, Qi), 
which suffices as sm{U, Qi) < tan(t/, Qi). 

To start with, we know from Lemma 1 that for Eq = 
i/c/{kd) with some constant c, Pr[tan([/, Qq) > £o] < 
(5o, using the fact that tan(C/, Qo)^ = 1/ cos(C/, Qo)^ — 1- 

Next, we would like to bound each tan(C/, Qi) in terms 
of the previous tan{U,Qi-i). For this, recall that with 
Pi = Yjtieu we have QiRi = FiQ^^i, which 


the bound (5) there to hold. 


Algorithm 

Complexity 

Restriction 

SGD family 

Balsubramani et al. (2013) 


only for fc = 1 

Sa et al. (2015) 

0(iog(iA)) 

pre-specified e 

our proposed SPCA 


none 

block power method family 

Hardt and Price (2014) 


pre-specified e 

our proposed DBPCA 

Q ^ log{log(l/£)) ^ 

none 


Table 1: sample complexity and restriction 


can be rewritten as AQi-i -|- (F) — A)Qi-i. Using the no¬ 
tation Gi = {Fi — A)Qi-i, we have QiRi = AQi-i + Gi, 
where Gi can be seen as the noise arising from estimating 
A by Fi using the f-th block of samples. Then, we rely on 
the following lemma from Hardt and Price (2014), with the 
parameters: 

A = max(A, A/4), 7 = (A/A)^^"^ and A = (A — A)/4. 

( 2 ) 

Lemma 6. (Hardt and Price, 2014) Suppose ||G|j < A • 
min(cos(?7, Q),/3),/or some j3 > 0. Then 

ian(U,AQ + G) < max(/3,max(,5,7)tan(17, Q)). 

From this, we can have the following lemma, proved in 
Appendix D, which provides an exponentially-decreasing 
upper bound on tan([/, Qi), for the parameters; 

£i = eo 7 * and ft = min ^ 7 /-^! -befti, l£i-^ 

where eg = sjc/idk) with the constant c in Lemma 1. 

Lemma 7. Suppose tan{U,Qi-i) < £i-i and ||Gi|| < 
Aft. Then tan{U,Qi) < Ei. 

The key which sets our approach apart from that of 
Mitliagkas et al. (2013); Hardt and Price (2014) is the fol¬ 
lowing observation. According to Lemma 7, for earlier it¬ 
erations, one can in fact tolerate a larger ||Gi|| and thus a 
larger empirical error for estimating A. This allows us to 
have smaller blocks at the beginning to save the number 
of samples, while still keeping the failure probability low. 
More precisely, we have the following lemma, proved in 
Appendix E, with the parameters: 

Si = Sq/{2i^) and |/| = (c/(Aft)^) log((f/i5i), (3) 

where Jg is the error probability given in Lemma 1 and c is 
a large enough constant. 

Lemma 8. For any i>l, given |/| samples in iteration i, 
we have Pr[||Gi|| > Aft] < Si. 
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With this, we can bound the failure probability of our algo¬ 
rithm as 

Pr > 0 : tan({7, Qi) > Si] < 

Pr [tan({7, Qo) > eo] + Ei>i [l|Gi|| > A/3i] 

which by Lemma 1 and Lemma 8 is at most 5o+X]i>i — 
1^0 + J2i>l w — ^^ 0 - 

To complete the proof of Theorem 2, it remains to bound 
the number of samples needed for achieving error e. For 
this, we rely on the following lemma which we prove in 
Appendix F. 

Lemma 9. For some L < O log we have El < £ 

Finally, as A = max(A, A/4), we have A — A > n(A — 
A), and putting this into the bound above yields the sample 
complexity bound stated in the theorem. 

5 Experiment 

We conduct experiments on two large real-world datasets 
NYTimes and PubMed (Bache and Lichman, 2013) as used 
by Mitliagkas et al. (2013). The dimension d of the data 
points in the datasets are 102 and 141 thousands, respec¬ 
tively, which match our memory-restricted setting. The 
features of both datasets are normalized into [0,1]. 

Parameter tuning is generally difficult for streaming algo¬ 
rithms. Instead of tuning the parameters extensively and 
reporting with the most optimistic (but perhaps unrealistic) 
parameter choice for each algorithm, we consider a thor¬ 
ough range of parameters but report the results of four pa¬ 
rameter choices per algorithm, which cover the best param¬ 
eter choice, to understand each algorithm more deeply. 

We compare the proposed SPCA and DBPCA with Alec- 
ton (fixed-learning-rate; Sa et al., 2015) and BPCA (fixed- 
block-size; Hardt and Price, 2014). For Alecton, we report 
the results of the learning rate 7 G { 10 “^, • • • , 10 ^}, with 
reasons to be explained in Subsection 5.1. For SPCA, we 
follow its existing work (Balsubramani et al., 2013) to fix 
no = 0 while considering c S {10^, • • • , 10®}. Then we 
report the results of c G {10®, 10^, 10®, 10®}. For BPCA, 
we follow its existing works (Mitliagkas et al., 2013; Hardt 
and Price, 2014) and let the block size be [A^/TJ, where 
is the size of the dataset and T is the number of blocks. 
Theoretical results of BPCA (Hardt and Price, 2014) sug¬ 
gest T = O logd^. Because A and A are unknown 
in practice, Mitliagkas et al. (2013); Hardt and Price (2014) 
set T — [LlogdJ with L = 1. Instead, we extend 
the range to L G {5“^, • • • , 5®} and report the result of 

{ 50 , 5 ^, 52 , 5 ®}. 



(a) Alecton, A: = 4 (b) SPCA, fc = 4 



(c) BPCA, k = 4 (d) DBPCA, fc = 4 


Figure 1: Performance of different algorithms on NYTimes 
when fc = 4 



T= 10® 

T = 2 X 10® 

Alecton 

0.232 ±0.028 

0.148 ±0.008 

SPCA 

0.159 ±0.008 

0.079 ±0.006 

BPCA 

0.234 ±0.021 

0.177 ±0.012 

DBPCA 

0.138 ± 0.008 

0.064 ±0.005 


Table 2; Performance of different algorithms with the best 
parameter on NYTimes when k = 4 


For the proposed DBPCA, we set the initial block size as 
2k to avoid being rank-insufficient in the first block. Then, 
we consider the ratio 7 ^ G {0.6, 0.7,0.8, 0.9} for enlarging 
the block size. 

We run each algorithm 60 times by randomly generating 
data streams from the dataset. We consider sin(17, (5„)2, 
which is the error function used for the convergence anal¬ 
ysis, as the performance evaluation criterion. The average 
performance on the two datasets for k = 4 and fc = 10 are 
shown in Figure 1, Figure 2, Figure 3 and Figure 4, respec¬ 
tively. Our experiments on other k values lead to similar 
observations and are not included here because of the space 
limit. Also, we report the mean and the standard error of 
each algorithm with the best parameters in Tables 2, 3, 4 
and 5. To visualize the results clearly, we crop the figures 
up to n = 200 , 000 , which is sufficient for checking the 
convergence of most of the parameter choices on the algo¬ 
rithms. 

5.1 Comparison between SPCA and Alecton 

The main difference between SPCA and Alecton is the 
rule of determining the learning rate. The learning rate 

"'Note that (2) suggests 7 ^ > 0.5. 
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(a) Alecton, fc = 10 (b) SPCA, k = IQ 



(c) BPCA, fe = 10 (d) DBPCA, fc = 10 


Figure 2: Performance of different algorithms on NYTimes 
when fc = 10 



T = 10° 

T = 2 X 10° 

Alecton 

0.385 ±0.013 

0.386 ±0.012 

SPCA 

0.170 ± 0.023 

0.102 ±0.018 

BPCA 

0.487 ±0.042 

0.317 ±0.034 

DBPCA 

0.207 ±0.028 

0.151 ±0.022 


Table 3; Performance of different algorithms with the best 
parameter on NYTimes when fc = 10 


of SPCA will decay along with the number of iterations, 
which means it could achieve arbitrarily small error when 
we have more data. On the other hand, Alecton needs to 
pre-specify the desired error to determine a fixed learning 
rate. To achieve the same error, from Table 1, SPCA and 
Alecton have the same asymptotic convergence rate the¬ 
oretically. Next, we aim to study their empirical perfor¬ 
mance. 

Sa et al. (2015) use a conservative rule to determine the 
learning rate. The upper bound of the learning rate 7 sug¬ 
gested in Sa et al. (2015) is smaller than 10“® for both 
datasets. However, this conservative and fixed learning rate 
scheme takes millions of iterations to converge to the com¬ 
petitive performance with SPCA. Similar results can also 
be found in Sa et al. (2015). 

Although the suggested learning rate should be small, we 
still study performance of Alecton with larger learning 
rates, which are from 10“"^ to 10“^. We report the results of 
{ 10 “^, 10 °, 10 ^, 10 ^}, which contain the optimal choices 
of the used datasets. Obviously, SPCA is generally bet¬ 
ter than Alecton, such as the case in Figure 1. From Ta¬ 
bles 2, 3, 4 and 5, SPCA outperforms Alecton with the best 
parameters, which demonstrates the advantage of the de¬ 



Numberofdata xio® Numberofdata xio® 


(a) Alecton, fc = 4 (b) SPCA, fc = 4 



(c) BPCA, fc = 4 (d) DBPCA, fc = 4 


Figure 3: Performance of different algorithms on PubMed 
when fc = 4 



T = 10° 

T = 2 X 10° 

Alecton 

0.051 ±0.007 

0.042 ± 0.000 

SPCA 

0.033 ±0.000 

0.022 ± 0.000 

BPCA 

0.045 ±0.001 

0.044 ± 0.000 

DBPCA 

0.026 ± 0.000 

0.013 ±0.000 


Table 4: Performance of different algorithms with the best 
parameter on PubMed when fc = 4 

cayed learning rate used by SPCA. From all figures, al¬ 
though Alecton with a larger learning rate (7 = 10) has a 
faster convergence behaviour at the beginning, it is stuck 
at a suboptimal point and can not utilize the new incoming 
data. The smaller learning rate could usually results in bet¬ 
ter performance in the end, but it takes more iterations than 
the number SPCA needs. 

5.2 Comparison between DBPCA and BPCA 

From Figure 1 and Figure 2, DBPCA outperforms BPCA 
under most parameter choices when fc = 4, and is compet¬ 
itive to BPCA when fc = 10. The edge of DBPCA over 
BPCA is even more remarkable in Figure 3 and Figure 4. 
From the result of the best parameters, DBPCA is signifi¬ 
cantly better than BPCA by f-test at 95% confidence. 

BPCA has the similar drawback to Alecton. As can be ob¬ 
served from Figures 1, 2, 3 and 4, if L is too small (larger 
block), BPCA only sees one or two blocks of data within 
n = 200, 000, and cannot reduce the error much. BPCA 
typically needs L > 1 (smaller blocks) to achieve lower 
error in the end. L = 125 gives the best performance of 
BPCA in Figures 3 and 4. However, sometimes large L 
(small blocks) in BPCA allows reducing the error in the 
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(a) Alecton, fc = 10 (b) SPCA, k = IQ 



(c) BPCA, fe = 10 (d) DBPCA, fc = 10 


Figure 4: Performance of different algorithms on PubMed 
when fc = 10 



T = 10^ 

T = 2 X 10’^ 

Alecton 

0.291 ±0.007 

0.292 ±0.009 

SPCA 

0.274 ±0.007 

0.190 ±0.040 

BPCA 

0.415 ±0.037 

0.203 ±0.030 

DBPCA 

0.212 ±0.024 

0.141 ±0.031 


Table 5; Performance of different algorithms with the best 
parameter on PubMed when fc = 10 


beginning, the error cannot converge to a competitive level 
in the long run. For instance, in Figure 2(c), L — 125 con¬ 
verges fast but cannot improve much after n = 50, 000; 
L = 25 converges slower but keeps going towards the low¬ 
est error after n = 200,000. Also, using smaller blocks 
cannot ensure reducing the error after each update, and 
hence BPCA with larger L results in less stable curves even 
after averaging over 60 runs. The results shows the diffi¬ 
culty of setting parameters of BPCA by the strategy pro¬ 
posed in Mitliagkas et al. (2013); Hardt and Price (2014). 

On the other hand, DBPCA achieves better results by us¬ 
ing a smaller block in the beginning to make improvements 
and a larger block later to further reduce the error. Also, in 
both datasets and under all parameter choices, DBPCA sta¬ 
bly reduces the error after each update, which matches our 
theoretical analysis that guarantees error reduction with a 
high probability. In addition, DBPCA is quite stable with 
respect to the choice of 7 ^ across the two datasets, making 
it easier to tune in practice. The properties make DBPCA 
favorable over BPCA in the family of block power meth¬ 
ods. 


5.3 Comparison between DBPCA and SPCA 

As observed, DBPCA is less sensitive to the parame¬ 
ter 7 that corresponds to the theoretical suggestion of 
max(A/A, 1/4) i. Somehow SPCA is rather sensitive to 
the parameter c that corresponds to the theoretical sugges¬ 
tion of For instance, setting c = 10^ results in strong 
performance when fc = 4 in Figure 1(b), but the worst per¬ 
formance when fc = 10 in Figure 2(b). Similar results can 
be observed in Figure 3(b) and Figure 4(b) when c = 10^. 
Furthermore, the parameter c in SPCA directly affects the 
step size of each gradient descent update. Thus, compared 
with the best parameter choice, larger c leads to less sta¬ 
ble performance curve, while smaller c sometimes results 
in significantly slower convergence. The results suggest 
that SPCA needs a more careful tuning and/or some deeper 
studies on proper parameter ranges. 

From Tables 2, 3, 4 and 5, DBPCA significantly outper¬ 
forms SPCA in 6 out of 8 cases by /-test under 95% 
confidence. The result supports the theoretical study that 
DBPCA has better converges rate guarantee than SPCA. 

However, the benefit of SPCA is its immediate use of new 
data point. DBPCA, as a representative of the block-power- 
method family, cannot update the solution until the end of 
the growing block. Then, the latter points in the larger 
blocks may be effectively unused for a long period of time. 
For instance, in Figure 2, DBPCA uses larger blocks than 
the necessary size. After N = 150, 000, the block size is 
near to 20 , 000 , which is less efficient. 

6 Conclusion 

We strengthen two families of streaming PCA algorithms, 
and compare the two strengthened families fairly from both 
theoretical and empirical sides. For the SGD family, we an¬ 
alyze the convergence rate of the famous SPCA algorithm 
for the multiple-principal-component cases without speci¬ 
fying the error in advance; for the family of block power 
methods, we propose a dynamic-block algorithm DBPCA 
that enjoys faster convergence rate than the original BPCA. 
Then, the empirical studies demonstrate that DBPCA not 
only outperforms BPCA often by dynamically enlarging 
the block sizes, but also converges to competitive results 
more stably than SPCA in many cases. Both the theoretical 
and empirical studies thus justify that DBPCA is the best 
among the two families, with the caveat of stalling the use 
of data points in larger blocks. 

Our work opens some new research directions. Empirical 
results seem to suggest SPCA is competitive to or slightly 
worse than DBPCA. It is worth studying whether it is re¬ 
sulted from the substantial difference between log 1 /e and 
log log 1 /e or caused by the hidden constants in the bounds. 
So one conjecture is that the bound in Theorem 1 can be 
further improved. On the other hand, although (2) sug- 
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gests 7 ^ > 0.5, the empirical results show that larger 7 ^ 
generally results in better performance. Hence, it is also 
worth studying whether the lower bound could be further 
improved. 

References 

Arora, R., Cotter, A., Livescu, K., and Srebro, N. (2012). 
Stochastic optimization for pea and pis. In Annual Aller- 
ton Conference on Communication, Control, and Com¬ 
puting. 

Arora, R., Cotter, A., and Srebro, N. (2013). Stochastic 
optimization of pea with capped msg. In Advances in 
Neural Information Processing Systems 26. 

Bache, K. and Lichman, M. (2013). UCI machine learning 
repository. 

Balsubramani, A., Dasgupta, S., and Freund, Y. (2013). 
The fast convergence of incremental pea. In Advances 
in Neural Information Processing Systems. 

Golub, G. H. and Van Loan, C. F. (1996). Matrix Compu¬ 
tations (3rd Ed.). Johns Hopkins University Press. 

Hardt, M. and Price, E. (2014). The noisy power method: A 
meta algorithm with applications. In Advances in Neural 
Information Processing Systems 27. 

Karnin, Z. and Liberty, E. (2015). Online pea with spectral 
bounds. In Conference on Learning Theory. 

Milman, V. D. and Schechtman, G. (1986). Asymptotic the- 
ory of finite-dimensional nonned spaces. Lecture Notes 
in Mathematics. Springer. 

Mitliagkas, I., Caramanis, C., and Jain, P. (2013). Memory 
limited, streaming pea. In Advances in Neural Informa¬ 
tion Processing Systems. 

Nie, J., Kotlowski, W., and Warmuth, M. K. (2013). Online 
pea with optimal regrets. In International Conference on 
Algorithmic Learning Theory. 

Oja, E. and Karhunen, J. (1985). On stochastic approx¬ 
imation of the eigenvectors and eigenvalues of the ex¬ 
pectation of a random matrix. Journal of Mathematical 
Analysis and Applications. 

Rakhlin, A., Shamir, O., and Sridharan, K. (2012). Making 
gradient descent optimal for strongly convex stochastic 
optimization. In International Conference on Machine 
Learning. 

Sa, C. D., Re, C., and Olukotun, K. (2015). Global con¬ 
vergence of stochastic gradient descent for some non- 
convex matrix problems. In International Conference on 
Machine Learning. 

Warmuth, M. K. and Kuzmin, D. (2008). Randomized On¬ 
line PCA Algorithms with Regret Bounds that are Loga¬ 
rithmic in the Dimension. Journal of Machine Learning 
Research. 


A Proof of Lemma 3 

Using the notation v = Y„_iv/||Y„_iv|| and A„ = 
x„x,[, one can follow the analysis in Balsubramani et al. 
(2013) to show that $ 1 '"^ < -f /?„ — with 

• fin = + 2yl, 

• Zn = 27 „(v^C/U^A„v - ||C/^v|pv^A„v), and 

. E[Z„|.F„_i] > 27 „(A- > 0 . 

We omit the proof here as the adaptation is straightforward. 
It remains to show our better bound on |Z„|. Eor this, note 
that 


\Zn\ < "ilu - llU^Vfv^ll • ||A„v||, 

where ||A„v|| < 1 and 

||v^[/[/T _ ||f/T^|j 2 ,^T ||2 

= llu^vf-2||t/^vr + ||U^vr 
= llU^vf(l-IIU^vf). 

As ||C/^vf < 1 and (l - ||C/^v||2) = we have 

l^nl < 27n^^£^. 

B Proof of Lemma 4 


Assume that the event Fq holds and consider any n S 
[no,ni). We need the following, which we prove in Ap¬ 
pendix B.l. 

Proposition 1. Lor any n> m and any v G 
\\U^YnV\\ ^ /TON3 c \\U^YmV\\ 

ll^nll -VnJ ■ ||U„|| ■ 


From Proposition 1, we know that for any v G S, 

WU^Y^srW ||U^y„v|| IIU^YovII 

rnvil - IIL.II -U^ IlYoll ’ 

where (no/n)^'^ > (no/ni)^'^ > (l/ci)^‘^ for the constant 
Cl given in Remark 1. As Tq = Qo and HQoll = 1 = 
||(5ov|j, we obtain 


l|t^^^nV|| ||£^^Qov|| yi-po ^ / C 

||>^nV|| “ cf IIQovIl “ cf V cf^kd' 

Therefore, assuming Fq, we always have 


= max 

V 


1 - 


rnvp ; 


< 1 - -ir-r-, 

ef'^kd 


= Pi- 
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B.l Proof of Proposition 1 

Recall that for any n, Yn = Yn-i + 7 „x„xjy„_i and 
||x„x^ II < 1. Then for any v G 

||c/Tr„v|| ^ ||c/Tr„_iv||-7„||C/Ty„_iv|| 

||y„|| - ||y„_i||+7„|iF„_i|| 

which is 

1 -7« WU^Yn-l^ -3-y„. \\U^Yn-l^\\ 

l+7n’ IlK-lll - rn-lll ’ 

using the fact that 1 — a: > e“^^fora: < 1/2 and 7 „ <1/2. 
Then by induction, we have 


According to this, we can choose ai = {pi+i — Pi)/2 and 
e = aiy/l — pi/(16cf°) so that with ||u — v|| < e, we have 
< ai. This means that given any v G 5 
with > pi+i, there exists some u G with $1“^ > 
Pi+i — ai = pi + ai- Asa result, we can now apply a union 
bound over Vi and have 

sup > Pi + a^\Ti . 

n^rii 

(4) 

To bound this further, consider the following two cases. 

First, for the case of i < tti, we have pi >3/4 and iji = 
1 — Pi < 1/4, so that 


Prhr.+ilTi] < Pr 


WU^Y^vW ^ \\U^Y^^ 

rji - ™ ■ iiPmii 

The Proposition follows as 

g-3Er>„7. =e-3cEr>„/ > 

using the fact that i<C = In(^). 


Pi < Pie '"P = (1 - Pi)e < e < 1 - 377i. 


Then ai > ((1 — 2pi) — (1 — 2>pi)) /2 = pi/2, which is 
at least 12c^/ni_i, as rji > pi > c/{c\‘^kd) and ni_i > 
no = (fk^cP log d for a large enough constant c. There¬ 
fore, we can apply Lemma 10 and the bound in (4) becomes 




<5o 

2(t + l)2- 


C Proof of Lemma 5 


Next, for the case of i > tti, we have pi <3/4 so that 


According to Lemma 3, our satisfy the same recur¬ 

rence relation as the functions 4'„’s of Balsubramani et al. 
(2013). We can therefore have the following, which we 
prove in Appendix C. 1. 

Lemma 10. Let pi = Then for any 

u G iS and ai > 12c^/ni_i, 


Pr 


sup > Pi -f ai I Pi 

n'>ni 


< g-n((a?/(c"p 7 )n,_i)^ 


Our goal is to bound Pr [^Pi+i iPi], which is 




as Co > 12 by assumption. Since pi+i > Pi/[e^/'^°]^, this 
gives us ai > Pi(|"e®/‘^°]“^ — “^)/2, which is at 

least 12c^/ni_i, as pi, according to our choice, is about 
C 2 (c^fclogni_i)/(ni_i -f 1) for a large enough constant 
C 2 . Thus, we can apply Lemma 10 and the bound in (4) 
becomes 


(Ci/PO e - 2 (* + 1 ) 2 - 


(5) 


This completes the proof of Lemma 5. 


Pr 


3vG5: sup > Pj+ilPi 

ni<n<ni+i 


As discussed before, we cannot directly apply a union 
bound on the bound in Lemma 10 as there are infinitely 
many v’s in S. Instead, we look for a small “e-net” Vi of 
S, with the property that any v G 5 has some u G I?i with 
||v —u|| < e. Such al?i with |I?i| < (1/e)®^*^^ is known to 
exist (see e.g. Milman and Schechtman (1986)). Then what 
we need is that when v and u are close, and are 
close as well. This is guaranteed by the following, which 
we prove in Appendix C.2. 

Lemma 11. Suppose L^ happens. Then for any n G 
[ni,ni+i), any e < a/1 — pij (2c®°), and any u, v G 5 
with IIu — v|| < e, we have 

< 16ci°e /sjl - Pi. 



C.l Proof of Lemma 10 


By Lemma 3, the random variables satisfy the same 

recurrence relation of Balsubramani et al. (2013) for their 
random variables $„’s. Thus, we can follow their analy¬ 
sis^, but use our better bound on |Z„|, and have the follow¬ 
ing. 

First, when given F^, we have |Z„| < 2jny/fH for < 
n < Ui. Then one can easily modify the analysis in Bal¬ 
subramani et al. (2013) to show that for any f > 0, 


E 


i-n 

e "• Fi 


< exp (tpi + P{6t -I- 2Ppi) 


rii-i 


by noting that (n^ + l)/(ni_i -I- 1) = and n > 

no = &^k^df logd according to our choice of parameters. 

^In particular, their proofs for Lemma 2.9 and Lemma 2.10. 



























Chun-Liang Li, Hsuan-Ten Lin, Chi-Jen Lu 


Next, following Balsubramani et al. (2013) and applying 
Doob’s martingale inequality, we obtain 


Pr 


sup >Pi + ai\Ti 

n>ni 


< E 


exp { -t{pi + ai) + — {6t + 2t^pi) 
n. 


< exp —tai H- {6t + 2Vpi) 

rii-i 

ta, 2c^fpi 

< exp —— +- 

2 Hi-1 


D Proof of Lemma 7 

As C0s(17,g,_i)2 = ^ T+fc ^ 

have IIGill < A/3i < A cos(C/, Qi-i). Thus, we can apply 
Lemma 6 and have 

tan(17, A(5i_i + Gi) < max(^i,max(/3i, 7 )£j_i), 

which is at most max(/?,, 7 ei_i) < ')£i-i = e^. The 
lemma follows as tan(t7, Qi) = tan(?7, AQi-i + Gi). 

E Proof of Lemma 8 


72 c' 

rii 

the lemma. 


as ai > Finally, by choosing t = , we have 


C.2 Proof of Lemma 11 

Assume without loss of generality that < (oth¬ 
erwise, we switch v and u), so that 


$A) _ $(>^ 




ir„v|| 


WU^YnnW 

lir„u||2 


Let p = APi and note that ||Gi|j < ||A — Fi\\, where F) is 
the average of \Ii\ i.i.d. random matrices, each with mean 
A. Recall that ||A|| < 1 by Assumption 1. Then from a 
matrix Chernoff bound, we have 

Pr[||G,|| > p] < Pr[|lA-F,|| > p] < < (5,, 

for |/i| given in (3). 

F Proof of Lemma 9 


As llv — u|| < e, we have 


IK 


To relate this to 


||GTy„v|| ^ ||GTr„u|| +e||GTy„ 


Ku -e K 


(6) 


lit/' V^ull 


-, we would like to express 

||G^K|| in terms of ||G^Ku|| and ||K|| in terms of 
||Ku|l- For this, note that both ||G^Ku||/||G^K|| 
and ||Ku||/||K|| are at least ||G^Ku||/||K||, which by 
Proposition 1 is at least 


(=ir) 




>C7“^ " " , (7) 

rii-i II 


|K._J| - ^ IIK 


using the fact that ni-xjn > rii-i/rii+i > 1/cf. Then as 
Ki_i = Qni_i and ||Qni_i|| = ||Qni_iU||, therighthand 
side of (7) becomes 

> cr«'7T^. 

given Ti. What we have obtained so far is a lower bound 
for both IlG^Kull/llG^KlI and ||Ku||/|jK||- Plugging 
this into ( 6 ), with e = ecf'^/vT^-T^, we get 

||GTy„v|| ^ |iGTy„u||(l + e) 

IlKvIl - llKull(l-e) • 

As a result, we have 


$G) _ $(u) 


< J HLM e (( 1 ±^ - 11 < i 6 g 

- IlKulP 1,(1-e)2 


^ - 1 < (t 4|)2 < 16e for e < 1 / 2 . 


(l-c)2 


Let L be the iteration number such that £l-i > £ and < 
£. Note that with £l = eo 7 ^ = £o(l — (A — A)/A)^/^ < 
we can have 


As the number of samples in iteration i is 

17.1 = o I AW(4) < o t iog(di) 


,(A-A)2ft7- VlA-A)"/?? 

the total number of samples needed is 

L / 1 / T^N \ L 




1 


2=1 


2 = 1 


With Pi = min( 7 /Y^l + £f_.^^,j£i-i), one sees that for 
some io < G(logd), Pi = 7/^1 + when i < ig and 


Pi = 7 £i-i = £i when i > ig. This implies that 

L . io 1 , .9 L 


1 *0 1 -I- ^2 

e4=e^+ e 


1 


where the first sum in the righthand side of ( 8 ) is 


( 8 ) 


7 


^ ^2^22-4 ^ , ^0 


2=1 

while the second sum is 


72 72(1 — 72 ) ’ 


^ 72(i-0 1 

E —=--.< 


+1 £/ (1-7^)4 7^(1-7")£" 
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using the fact that el = 7 £l-i > 7 £- Since 7 ^ = 
/ -\l/2 

M _ j < 1 — we have and 

since A < 0(A), we also have ^ < 0(1). Moreover, as 

we assume that e < l/y/M, we can conclude that the total 
number of samples needed is at most 


2 = 1 


[ logjdL) 

UA-A)2 


■O 


A 

(A - A)e2 


/ \ \og{dL) \ 

{sHX-XrJ- 



